My notes on Actor-Critic Algorithm.
| Symbol | Definition |
|---|---|
| $s \in S$ | $s$ denotes a state. |
| $a \in A$ | $a$ denotes an action. |
| $r \in R$ | $r$ denotes a reward. |
| $ \pi(a \vert s) $ | Policy function, returns probability of choosing action $a$ in state $s$. |
| $V(s)$ | State-Value function, Measures how good a state is. (in terms of expected reward). |
| $V^\pi (s)$ | State-Value function, When we are using policy $\pi$. |
| $Q^\pi$ | Action-value function, Measures how good an action is. |
| $Q^\pi (s, a)$ | Action-value function, How good is to take action $a$ in state $s$ when we use policy $\pi$. |
| $\gamma$ | Discount factor. |
| $G_t$ | Total return value. |
| $Q^\pi$ | Action-value function. |
| $V^\pi$ | State-value function. |
In traditional policy gradient method, we have one neural network for our policy and after earch episode, we update its parameters $\theta$. The main idea behind Actor-Critic method is, why not using Value-Function for our policy updates? It's been said that Value-Function can be helpful by reducing the variance of policy gradient (or variance of gradient of policy)! 🤷
In batch mode you gather lots of data by interacting with the environment and sampling from policy $\pi$, and then updating your parameters using all information you gathered.
In online mode, you run your policy and after each interaction with the environment, you update your parameters. In practice we don't use this mode, because we are using neural networks and SGD. If you manage yo parallelize simulation, you can use this method.
If you don't understand the above formulation, here is another one for you:
In this algorithm we have two separate learning rates, $\alpha_\theta$ and $\alpha_w$
In implementing Actor-Critic algorithm, there are two major network designs. The first one is use two separate networks, one for fitting value-function and the other for policy (left). This design is very simple and easy to implement. In some cases your state contains pixels of the simulation and by having two separate networks, your two network must learn similar features and proeprties, by their own. We can avoid this by using one neural network with two heads (right).
This is just re-hash of what's already out there, nothing new per se.
In [1]:
# Import all packages we want to use
import numpy as np
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
CartPole-v1A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
| Property | Default | Note |
|---|---|---|
| Max Episode Length | 500 |
Check out this line |
| Action Space | +1, -1 |
The system is controlled by applying a force of +1 or -1 to the cart |
| Default reward | +1 |
A reward of +1 is provided for every time-step that the pole remains upright |
In [2]:
# Preparing the Cart Pole
env = gym.make('CartPole-v1')
env.seed(0)
torch.manual_seed(0)
gamma = 0.99
In [3]:
class ActorCriticTwoNetworks(nn.Module):
def __init__(self):
super(ActorCriticTwoNetworks, self).__init__()
# Critic Network
self.critic = nn.Sequential(
nn.Linear(4, 128),
nn.ReLU(),
nn.Linear(128, 1))
# Actor Network
self.actor = nn.Sequential(
nn.Linear(4, 128),
nn.ReLU(),
nn.Linear(128, 2),
nn.Softmax(dim=-1))
def forward(self, x):
value = self.critic(x)
action_scores = self.actor(x)
dist = Categorical(action_scores)
return dist, value
In [4]:
class ActorCritic(nn.Module):
def __init__(self):
super(ActorCritic, self).__init__()
self.layer1 = nn.Linear(4, 128)
self.policy_head = nn.Linear(128, 2)
self.value_head = nn.Linear(128, 1)
self.saved_actions = []
self.rewards = []
def forward(self, x):
x = F.relu(self.layer1(x))
action_scores = self.policy_head(x)
state_value = self.value_head(x)
return F.softmax(action_scores, dim=-1), state_value
In [5]:
# Create an instance of our model.
# We are using CPU only mode, for CUDA mode, call .cuda() after ActorCritic()
model = ActorCritic()
optimizer = optim.Adam(model.parameters(), lr=3e-2)
# Retriving a value for epsilon using numpy's built-ins
eps = np.finfo(np.float32).eps.item()
In [6]:
from collections import namedtuple
# A Container to save both log_probs and value after each interaction with the environment.
SavedAction = namedtuple('SavedAction', ['log_prob', 'value'])
In [7]:
def select_action(state):
# Convert current state to float tensor
state = torch.from_numpy(state).float()
# Run our model and get action probs and state value
probs, state_value = model(state)
# Using Categorical helper for sampling and log_probs
m = Categorical(probs)
selected_action = m.sample()
# storing some data for backpropagation
model.saved_actions.append(SavedAction(m.log_prob(selected_action), state_value))
# converting tensor to python scalar and returning it
return selected_action.item()
In [8]:
def model_optimization_step():
R = 0
saved_actions = model.saved_actions
policy_losses = []
value_losses = []
rewards = []
# Discounted Reward Calculation
for r in model.rewards[::-1]:
R = r + gamma * R
rewards.insert(0, R)
# Converting rewards to Tensor
rewards = torch.tensor(rewards)
# Normalizing rewards to zero mean and unit variance
rewards = (rewards - rewards.mean()) / (rewards.std() + eps)
# Going through actions/rewards to calculate losses
# We are doing Batch mode Actor-Critic
for (log_prob, value), r in zip(saved_actions, rewards):
# calculating Advantage
advantage = r - value.item()
# Policy loss; A*ln(𝜋(𝑎|𝑠))
policy_losses.append(-log_prob * advantage)
# Value-Function loss; || Value - discountedReward ||
value_losses.append(F.smooth_l1_loss(value, torch.tensor([r])))
optimizer.zero_grad()
# Total loss is sum of two losses; policy loss and value-function loss
loss = torch.stack(policy_losses).sum() + torch.stack(value_losses).sum()
# Call backward and optimization step on entire batch (entire data from one episode)
loss.backward()
optimizer.step()
# Preparing model for next episode; Removing data current episode
del model.rewards[:]
del model.saved_actions[:]
In [9]:
def train(num_episodes):
# Length of each episode
ep_history = []
for current_episode in range(num_episodes):
# Reseting the Environment
state = env.reset()
# Gathering data, with max step of 500
for t in range(500):
action = select_action(state)
state, reward, done, _ = env.step(action)
model.rewards.append(reward)
if done:
break
ep_history.append(t)
# Optimize our policy after gathering a full episode
model_optimization_step()
# Logging
if (current_episode+1) % 50 == 0:
print('Episode {}\tLast Epsiode length: {:5d}\t'.format(current_episode, t))
return ep_history
In [10]:
episodes_to_train = 1000
ep_history = train(episodes_to_train)
In [11]:
# Making plots larger!
matplotlib.rcParams['figure.figsize'] = [15, 10]
# X Axis of the plots
xx = range(episodes_to_train)
plt.plot(xx, ep_history, '.-')
plt.title('Episode Length')
plt.ylabel('Length of each Episode')
plt.show()
In [ ]: